Skip to content

Conversation

@smarterclayton
Copy link

@smarterclayton smarterclayton commented Sep 25, 2025

In some benchmarking and test environments dynamic prefill selection may be difficult and random selection among a set of hosts is sufficient.

Add a new --enable-prefiller-sampling flag that instructs the sidecar to select a random prefill host from the provided list instead of the first one. Make the behavior opt-in to prevent users from accidentally depending on the new behavior, and keep the existing default behavior (first header value) consistent.

E.g.:

curl -H 'x-prefiller-host-port: server1:8000` -H 'x-prefiller-host-port: server2:8000'

will randomly choose one of the two values.

This allows static test environments to use multiple hardcoded hosts for testing. A load balancer may still be desirable, but I am chasing an issue where using a load balanced prefiller group does not result in the correct serving behavior.

@smarterclayton
Copy link
Author

The configuration this allows is vllm bench serve to take a static x-prefiller-host-port header (via the newly added --header support) for a benchmark run that approximates round robin load balancing to DP>8 instances, without any other dependencies except a kube service to balance between the decoders.

In some benchmarking and test environments dynamic prefill selection
may be difficult and random selection among a set of hosts is
sufficient.

Add a new `--enable-prefiller-sampling` flag that instructs the
sidecar to select a random prefill host from the provided list
instead of the first one. Make the behavior opt-in to prevent
users from accidentally depending on the new behavior, and
keep the existing default behavior (first header value) consistent.

E.g.:

    curl -H 'x-prefiller-host-port: server1:8000` -H 'x-prefiller-host-port: server2:8000'

will randomly choose one of the two values.

Signed-off-by: Clayton Coleman <[email protected]>
@smarterclayton smarterclayton force-pushed the enable_prefiller_sampling branch from 1143dac to b49f6e3 Compare October 13, 2025 20:28
@elevran
Copy link
Contributor

elevran commented Oct 29, 2025

@smarterclayton as we've moved the routing sidecar code into llm-d-inference-scheduler, could you kindly close the PR here and more move the code over to the new repo if needed?
Also note that there are lint errors (unused param in test code)

@smarterclayton
Copy link
Author

Moved to llm-d/llm-d-inference-scheduler#404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants